To cluster or not to cluster

is that the question?

So before I start with my considerations on clustering, I think it deserves clarification that I am not an expert on the topic. And there are interesting papers on the literature that has clear advice on when to and when not to cluster standard errors. (see Abadie et al 2017)

Very recently, in fact (At the time of writing this, it is March 12, 2021), Prof. Wooldridge provided some strong advice on the use of clusters, suggesting that at the very least, one should provide robust standard errors. This is an option that is available in most econometric (including my favorite...Stata).

So what I am going to do here is something different. Rather than trying to provide you with math-heavy material justifying and proving when one "cluster" data, I'll provide simulation evidence on a very specific question:

What happens when you do not cluster, but you should?

Because the simulation evidence here is limited (to the assumptions of the Data generating process), some of the conclusions may be limited.

Never the less, I believe you can use this evidence to learn three things.

How to use simulations to analyze a particular strategy given the assumptions of your data.
How to interpret simulation results.
And of course, what happens when you do not cluster your data, when you should.

Ok so lets start.

Simulation Setup

When thinking about clustering standard errors, the most important example that comes to mind is panel data.

Panel data is a special type of data that micro and macroeconomist have access to. The most important characteristic of panel data is that you have a set of individuals $i$ that are followed across time $t$. Because this data provides a picture of information across individuals, but also across time, it is possible to control for individual unobserved characteristics, as long as they are fixed across time for each individual.

This structure, however, is not unique to panel data as described before. You can have a similar structure if you collect, say data for family members (the $i$) and their families (the $t$), or counties (the $i$) and states (the $i$).

Furthermore, despite the recent discussion regarding the identification of models with multiple fixed effects, you can consider even more complex data structures. For example, the one used in Abowd et al (2008), where individuals are followed across time (standard panel) but are observed transitioning across firms, which become a third dimension of the fixed effects.

So back to the problem. Let’s consider a very simple panel structure where the DGP is defined as: $$ y_{it} = a_0 + a_1 x_{it} + a_2 z_{it} + e_i + v_{it} \quad (1) $$ Where $e_i$ are individual unobserved (but time fixed factors), and $v_{it}$ are idiosyncratic errors that are different across individuals and time (also unobserved).

To keep things simple, I will assume that both errors are uncorrelated with explanatory variables, $x_{it}$ and $z_{it}$, so we can treat them as exogenous. We can assume that $x$ varies across time, but that $z$ is time-fixed. For simplicity, I'll assume a balanced dataset, so that all N individuals are observed for T periods. Under this structure, we can simulate the data as follows:

. * Consider 100 individuals:
. set obs 100
number of observations (_N) was 0, now 100
. * For which I'll create an identifier.
. gen id=_n
. * And I will assume that each individual is observed for 10 periods
. expand 10
(900 observations created)
. * Lets assume that X and Z are independent from each other.
. * but that both are "normal"
. gen x = rnormal()
. gen z = rnormal()
. * And Z is fixed for each individual
. bysort id:replace z=z[1]
(900 real changes made)
. * I use the same code for the unobserved factors. 
. gen e = rnormal()
. gen v = rnormal()
. * And Z is fixed for each individual
. bysort id:replace e=e[1]
(900 real changes made)
. * Finally, we construct our dependent variable Y
. gen y=1+x+z+e+v

For this simple DGP, we will contrast the results from 5 different estimation procedures:

Simple OLS
OLS with Robust standard errors
OLS with clustered standard errors
Panel with Random effects
Panel with Fixed effects

You may notice that options 1, 2, and 3 do not control for the individual fixed effect explicitly. Whereas option 5, try to control for this effect more explicitly. Options 4 also control time unobserved effects but through the construction of the standard error. And option 3 doesn't instead assume that errors within-person will be correlated (it is after all fixed).

Something you may also notice. Because $x$ and $z$ are exogenous independents of the errors, any strategy will provide you with unbiased estimates. However, not all estimations will obtain correct standard errors.

But, how can we see that? how can we know that coefficients will be unbiased, but standard errors may be incorrect?.
We make multiple simulations and summarize the results!

So we turn to more programming. What I need to do is create a program that simulates the data (as before), and estimates the model desired model (from the list before).
Second, I need to repeat the process a large number of times (usually tens of thousands), collect results from each estimation, and make the assessment.

Let's start with the program, recycling the code I had before (without the comments)

. * I'm writing this as an "eclass" estimator, so it is easy to collect the estimation results.
. program panel_re, eclass
  1. * a new line. "clear" to start from a clean dataset each time the program is called:
.         clear
  2.         set obs 100
  3.         gen id=_n
  4.         expand 10
  5.         gen x = rnormal()
  6.         gen z = rnormal()
  7.         sort id, stable
  8.         by id:replace z=z[1]
  9.         gen e = rnormal()
 10.         gen v = rnormal()
 11.         by id:replace e=e[1]
 12.         gen y=1+x+z+e+v
* here I estimate the model
* I will use two "locals": 1 and 2. 
* This will be used as place holders for the model
* and options for the estimation.
* I'll also set this data as panel data (it will be needed for some cases)
 13.         xtset id
 14.         `1' y x z, `2'
 15.         ** and that is it! 
. end

Alright, the structure of the program is done, including the two placeholders "1" and "2". Which I have to provide from the command line. For example, if I would like to run the simulation, using a simple OLS model, with cluster standard errors, I could do the following:

. *panel_re reg  cluster(id)
. *          ^       ^
. *          |       |
. *         '1'     '2'
. *Here reg will be argument 1
. *and cluster(id) argument 2
. panel_re reg  cluster(id)
number of observations (_N) was 0, now 100
(900 observations created)
(900 real changes made)
(900 real changes made)
       panel variable:  id (balanced)

Linear regression                               Number of obs     =      1,000
                                                F(2, 99)          =     355.71
                                                Prob > F          =     0.0000
                                                R-squared         =     0.4973
                                                Root MSE          =     1.4754

                                   (Std. Err. adjusted for 100 clusters in id)
------------------------------------------------------------------------------
             |               Robust
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   1.033408   .0409937    25.21   0.000     .9520676    1.114748
           z |   .9983844   .0895856    11.14   0.000     .8206271    1.176142
       _cons |   .9965414   .1162236     8.57   0.000     .7659286    1.227154
------------------------------------------------------------------------------

Next step, actually performing the simulations.

Robust, cluster, FE, etc

Ok so now that the program has been set up, we can implement the simulation a few times, to see what would happen if clusters are used or ignored when estimating this model.
While it is usually recommended to run 10000 iterations to properly assess the results, I'll only use 100 iterations, as that would be enough to see patterns. You are, of course, welcome to use more repetitions, and see what happens.

For the simulations, we will use the Stata command -simulate-, which makes this type of analysis simple. And I'll use a seed(101) to make the exercise replicable.

So let me start with the simple OLS regression with simple standard errors. I'll gather both the point estimates and the standard errors, from each replication:

. simulate _b _se, reps(100) seed(101):panel_re reg  

      command:  panel_re reg

Simulations (100)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..................................................   100

. sum, sep(3)

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        _b_x |        100    1.000355    .0381124   .8919857   1.085083
        _b_z |        100    .9876579    .0994754   .7406453     1.2002
     _b_cons |        100    .9984308    .1107775   .7436068   1.238203
-------------+---------------------------------------------------------
       _se_x |        100    .0444944    .0023557   .0397996   .0505446
       _se_z |        100    .0454697    .0039679   .0382312   .0611115
    _se_cons |        100    .0446636    .0022492   .0396397   .0515558

The first thing to see from here is that, as expected, OLS provides unbiased estimators. The coefficients for $x$ and $z$ are almost identical to the population coefficients of 1. Similarly with the constant.

For the standard errors, we now have two estimates. One based on the simulations (we should consider as the correct ones) (_b), and one that is the average of the simulated ones (_se).

What I get from these results is that the standard errors for $x$, the time-varying variables, seem to be estimated correctly. We see this because the "average" standard errors for x (_se_x) are close to the standard deviation for _b_x.
This, however, does not happen with the time constant variable $z$, or the constant. In this case, OLS provides an standard error estimate that is half of the simulation based one.
Why is that bad?
Because we wouldnt be able to do proper statistical inference from this model.
For example, If you assess how often do we reject the null hypothesis $H_0:\beta_z =1$, we will find that we do this "too often".
You can see that doing the following:

. gen is_bx_1 = !inrange(1,_b_x-_se_x*1.96,_b_x+_se_x*1.96)

. gen is_bz_1 = !inrange(1,_b_z-_se_z*1.96,_b_z+_se_z*1.96)

. gen is_bc_1 = !inrange(1,_b_c-_se_c*1.96,_b_c+_se_c*1.96)

. sum is*

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
     is_bx_1 |        100         .02    .1407053          0          1
     is_bz_1 |        100         .43    .4975699          0          1
     is_bc_1 |        100         .42     .496045          0          1

So using OLS, we reject the Null for $x$ 2% of the time (just what you would expect), but we do reject the null 43% and 42% of the time for $z$ and the constant.

What about if we use robust standard errors?
It would probably have no impact. Since the model is based on homoskedastic standard errors, but it wouldnt hurt to try:

. simulate _b _se, reps(100) seed(101):panel_re reg robust 

      command:  panel_re reg robust

Simulations (100)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..................................................   100

. gen is_bx_1 = !inrange(1,_b_x-_se_x*1.96,_b_x+_se_x*1.96)
. gen is_bz_1 = !inrange(1,_b_z-_se_z*1.96,_b_z+_se_z*1.96)
. gen is_bc_1 = !inrange(1,_b_c-_se_c*1.96,_b_c+_se_c*1.96)
. sum, sep(3)

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        _b_x |        100    1.000355    .0381124   .8919857   1.085083
        _b_z |        100    .9876579    .0994754   .7406453     1.2002
     _b_cons |        100    .9984308    .1107775   .7436068   1.238203
-------------+---------------------------------------------------------
       _se_x |        100    .0444072    .0027773   .0391417    .052421
       _se_z |        100    .0453708    .0049049   .0347456    .064886
    _se_cons |        100     .044696     .002205   .0396771   .0515227
-------------+---------------------------------------------------------
     is_bx_1 |        100         .02    .1407053          0          1
     is_bz_1 |        100          .4     .492366          0          1
     is_bc_1 |        100         .42     .496045          0          1

The case of $x$ is just the same as before, with about the same level of Null hypothesis rejection for the coefficient Z or the constant.

The next one could be regress, with cluster standard errors. This should consider account for that fixed unobserved effect:

. simulate _b _se, reps(100) seed(101):panel_re reg cluster(id) 

      command:  panel_re reg cluster(id)

Simulations (100)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..................................................   100

. gen is_bx_1 = !inrange(1,_b_x-_se_x*1.96,_b_x+_se_x*1.96)
. gen is_bz_1 = !inrange(1,_b_z-_se_z*1.96,_b_z+_se_z*1.96)
. gen is_bc_1 = !inrange(1,_b_c-_se_c*1.96,_b_c+_se_c*1.96)
. sum, sep(3)

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        _b_x |        100    1.000355    .0381124   .8919857   1.085083
        _b_z |        100    .9876579    .0994754   .7406453     1.2002
     _b_cons |        100    .9984308    .1107775   .7436068   1.238203
-------------+---------------------------------------------------------
       _se_x |        100    .0441533    .0051071   .0332627   .0670466
       _se_z |        100    .1049323    .0155818   .0745955   .1655501
    _se_cons |        100    .1045397    .0085235   .0841128   .1316457
-------------+---------------------------------------------------------
     is_bx_1 |        100         .04    .1969464          0          1
     is_bz_1 |        100         .05    .2190429          0          1
     is_bc_1 |        100         .06    .2386833          0          1

Now, we are talking. Average Standard errors for $z$ and the constant are about the same magnitude as the simulation-based ones (the benchmark).
Furthermore, now all rejection rates are just as expected, around 5%.

What if instead we control for those individual effects explicitly. We can do that using Panel FE:

. simulate _b _se, reps(100) seed(101):panel_re xtreg fe 

      command:  panel_re xtreg fe

Simulations (100)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..................................................   100

. gen is_bx_1 = !inrange(1,_b_x-_se_x*1.96,_b_x+_se_x*1.96)
. gen is_bc_1 = !inrange(1,_b_c-_se_c*1.96,_b_c+_se_c*1.96)
. sum, sep(3)

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        _b_x |        100    .9956048     .029711   .9376439   1.062583
      _sim_2 |        100           0           0          0          0
     _b_cons |        100    .9955147    .1434849   .6446086   1.326498
-------------+---------------------------------------------------------
       _se_x |        100    .0333269    .0011717   .0301809   .0364448
      _sim_5 |        100           0           0          0          0
    _se_cons |        100    .0316341     .000825   .0287964   .0339852
-------------+---------------------------------------------------------
     is_bx_1 |        100           0           0          0          0
     is_bc_1 |        100         .68    .4688262          0          1

So things now look different. Because we $z$ is time-constant we cannot estimate this coefficient when explicitly including individual fixed effects makes in the model.

It is slightly surprising that we never reject the Null of $H_0:\beta_x=1$!. While some of this is due to the limited number of repetitions, if I repeat the exercise with more repetitions (say 1000), I do observe a less than 5% rejection rate. It is also important to notice standard errors for the panel FE estimator are somewhat smaller compared to the OLS clustered standard errors.
This can be considered as a gain in efficiency for the estimation of $\beta_x$.

Unfortunately, there is No free lunch !. While it seems we can estimate $\beta_x$ more efficiently, we can no longer say the same for the constant. In fact, the simulation-based standard errors increase, and the rejection rate of the null increases as well.

Finally, to finish this popurri of methodologies, we can use a Panel Random effects model:

. simulate _b _se, reps(100) seed(101):panel_re xtreg re 

      command:  panel_re xtreg re

Simulations (100)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..................................................   100

. gen is_bx_1 = !inrange(1,_b_x-_se_x*1.96,_b_x+_se_x*1.96)
. gen is_bz_1 = !inrange(1,_b_z-_se_z*1.96,_b_z+_se_z*1.96)
. gen is_bc_1 = !inrange(1,_b_c-_se_c*1.96,_b_c+_se_c*1.96)
. sum, sep(3)

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        _b_x |        100    .9961247    .0289081   .9367011   1.064042
        _b_z |        100    .9877817    .0993992   .7413588   1.199385
     _b_cons |        100    .9984448    .1109847   .7435786   1.240776
-------------+---------------------------------------------------------
       _se_x |        100    .0331599     .001161    .030052   .0362103
       _se_z |        100    .1067879    .0120344   .0871601   .1579433
    _se_cons |        100    .1048667    .0087699   .0846491   .1327749
-------------+---------------------------------------------------------
     is_bx_1 |        100           0           0          0          0
     is_bz_1 |        100         .03    .1714466          0          1
     is_bc_1 |        100         .06    .2386833          0          1

And this seems to be a model that, under the strict assumptions of the DGP, is the best across all strategies.

Standard errors for $\beta_x$ are just as small as those based on the FE estimator.
We are still able to obtain estimates for the time fixed variable $z$. With standard errors of similar magnitude as OLS with clustered standard errors.
And we can also get correct rejection rates for the constant!

Alright, so there is still one more contender, a sixth methodology known as correlated random-effects model.

To be fair, this model is not necessary here, because we know the time fixed errors are uncorrelated with other characteristics. But that is unknown when analyzing real data.

To apply this method, however, I'll need to modify my simulation program slightly:

. program panel_cre, eclass
  1.         clear
  2.         set obs 100
  3.         gen id=_n
  4.         expand 10
  5.         gen x = rnormal()
  6.         gen z = rnormal()
  7.         sort id, stable
  8.         by id:replace z=z[1]
  9.         gen e = rnormal()
 10.         gen v = rnormal()
 11.         by id:replace e=e[1]
 12.         gen y=1+x+z+e+v
 13.         xtset id
 14.         by id:egen m_x=mean(x)  // <-- Here we estimate the within person mean characteristic
 15.         xtreg y x z m_x, re   // and control for it in the RE effects model.
 16. end

. simulate _b _se, reps(100) seed(101):panel_cre 

      command:  panel_cre

Simulations (100)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..................................................   100

. gen is_bx_1 = !inrange(1,_b_x-_se_x*1.96,_b_x+_se_x*1.96)

. gen is_bz_1 = !inrange(1,_b_z-_se_z*1.96,_b_z+_se_z*1.96)

. gen is_bc_1 = !inrange(1,_b_c-_se_c*1.96,_b_c+_se_c*1.96)

. sum, sep(4)

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        _b_x |        100    .9956048     .029711   .9376439   1.062583
        _b_z |        100    .9863926    .1006587   .7307997   1.207344
      _b_m_x |        100     .051984    .3436732  -.6213726   1.080986
     _b_cons |        100    .9982991    .1093144   .7438877   1.214026
-------------+---------------------------------------------------------
       _se_x |        100    .0333269    .0011717   .0301809   .0364448
       _se_z |        100    .1072043    .0120987    .087363   .1584293
     _se_m_x |        100    .3380803    .0343612   .2572779   .4165666
    _se_cons |        100    .1055241    .0090165   .0849954   .1330222
-------------+---------------------------------------------------------
     is_bx_1 |        100           0           0          0          0
     is_bz_1 |        100         .04    .1969464          0          1
     is_bc_1 |        100         .06    .2386833          0          1

And the last estimation strategy also performs well. Its rejection rate is comparable to both the panel FE estimator and the random FE estimator.
The something to be noted, however, is that it will perform well only if you use a random effect model, or if you use clustered standard errors (when using OLS). Otherwise, results will be just as bad as the simple OLS approach.

And of course, it will be better than RE if $e$ is correlated with any of the explanatory variables in the model.

Conclusions

So to cluster or not to cluster?
I started with this question to motivate the decision to use clustered standard errors when estimating a model when you suspect there are unobserved factors that are fixed in one particular dimension.

From the empirical approach, I have shown that if you have access to panel data, and you believe there are individual unobserved effects that are fixed in time, it could be a mistake not to report cluster standard errors. Even -robust- will not do much. You will be overconfident, at least for those variables that are fixed across time.

In cases like this, you will do much better by either controlling for fixed effects explicitly (for example using panel FE estimator), or implicitly (random effects), or at the very least correcting standard errors for their presence (clustered standard errors).

These conclusions are by no means exhaustive. However, they are a good starting point to build different simulation settings to test to what extent would a change of a particular assumption would have on your results, and how could a particular estimation methodology handle that problem.

If you are interested in a more in-deep discussion of when to (or not to) use clustered standard errors, you may find the work of Abadie et al (2017) interesting. See citation below.

And as always, comments are welcome.

References

Alberto Abadie & Susan Athey & Guido W. Imbens & Jeffrey Wooldridge, 2017. "When Should You Adjust Standard Errors for Clustering?," NBER Working Papers 24003, National Bureau of Economic Research, Inc.

Abowd, J. M., F. Kramarz, and S. Woodcock. 2008. Econometric analyses of linked employer-employee data. In The Econometrics of Panel Data: Fundamentals and Recent Developments in Theory and Practice, ed. L. M´aty´as and P. Sevestre, 3rd ed., 727–760. Berlin: Springer.